Contact:
Peter-Paul de Wolf
Statistics Netherlands
P.O. Box 24500
2490 HA The Hague
The Netherlands
Phone: +31 70 337 5060
Last update: 10 Oct 2011
|
OR-tools for tabular data (WP 4.1)
Leading partner: CBS
Participating partners: ULL, UPC
Task 1: (Responsability University La Laguna)
Objectives
We aim to develop, analyse and implement algorithms for solving some optimisation
problems arising when protecting sensitive information in tabular data.
In particular, it will address the resolution of applying Cell Suppression methodology
on hierarchical and linked tables.
Description of work
We will study several algebraic structures to capture the links between cells,
both in hierarchical and linked tables. This is a very relevant part as we will
try to use Integer Programming for modelling the optimisation problems, and then
we will try to use techniques based on Linear Programming to solve it. The models
need to capture the structure of the tables in such a way that the different between
the linear programming relaxation and the integer program is not too big.
Otherwise, some additional constraints and polyhedral analysis must be executed
to strengthening the linear relaxation. Polynomial-solvable cases and network
relaxation must be studied for a better understanding of the problem.
In this workpackage a junior programmer will help on organising in a database our
initial instances from CASC-partners and on getting statistics from them to a
better understanding of the instances. It is also important at this phase to implement
a pre-processing software to reduce the size of the instances as far as possible
(fixing variables, removing links). We will also start a web page with to make
public to our partners and to the Operations Research community how our work for
CASC-project is going on. Some meetings with practitioner partners will be needed
for the better understanding of the structure of the linked and hierarchical tables,
and for solving questions arising while modelling the optimisation problems.
As the work is mainly oriented to produce a software optimiser for dealing with
real-world instances, we will need some hardware (personal computers and printers)
and some software (compilers, memory checkers, and LP-solvers).
This material will be intensively used also during all the project life.
Mathematical Programming proposes several general frameworks for ILP optimisation,
including the branch-and-bound approach, cutting-plane procedures,
Lagrangean relaxations, meta-heuristic algorithms, dynamic programming, etc.
As a general rule, the particular features of the model in hand enforce special
adaptations of these approaches to get solutions within acceptable computing times,
even on medium-size instances. Sometimes the best results are obtained by a clever
mixing of different general procedures such as branch-and-bound and cutting-plane
generation. Cell suppression on linked and hierarchical tables are known to be
very difficult from the computational point of view (NP-hard in the strong sense).
The experience during the past SDC-project shows however that sophisticated methods
can be successful in most cases, at least for 2- and 3-dimensional tables.
The kernel of their algorithm for finding an optimal solution is based on an
implicit enumeration scheme, generally called branch-and-bound approach.
Pruning of the choices that need not to be explored explicitly is obtained by
solving a sequence of tighter and tighter relaxation of the initial ILP model.
For hierarchical and linked tables we will investigate an approach based on similar
ideas, but taking into account the increased difficulty coming from the new structures
to be handled.
Even if we are concerned with finding optimal solution to the problems in hand,
near-optimal solutions are necessary to guide and reduce the computational effort
spent within the implicit enumeration scheme. To this end two kinds of heuristic
algorithms will also be addressed, one for getting an initial feasible solution and
the other for providing near-optimal feasible solutions at each node of the
branch-decision tree. In particular, this second heuristic algorithm will be
invoked several times during the overall run, hence it must be very efficient.
Heuristics imbedded within the enumeration scheme are also important because they
allows the user to stop the run the exact algorithm before convergence, the output
being in this case a (typically extremely tight) approximate solution with an
indication of its distance from the optimum.
Also very important for the success of the algorithm in solving large instances
is the pre-processing phase aimed at reducing the size of the input instance that
need to be considered explicitly. For example, when applying cell suppression it
often happens that some sensitive cells are already protected by the suppression of
the other primary suppressions, hence they actually need no additional secondary
suppressions. Moreover, when the instance corresponds with the subproblem coming
out from a decomposition of a bigger instance (say a liked table) then it could
be also infeasible, i.e., the protection levels can be set up so as there is not
feasible suppression pattern. This situation creates serious problem because it
is then even difficult to find approximated solutions, and it is a novel case not
studied in previous works. We will address also this challenging instances.
Because the arising optimisation problem will be surely very difficult to be solved,
it will be time-consuming to implement properly the theoretical ideas and make them work.
Some meeting by the junior and/or senior with tester partners will be needed to fix
final difficulties on the software. At month 30 the Callable library must be embedded
in τ-ARGUS, one of the overall deliverable of the CASC-project,
and all the new difficulties arising will be solved until month 35.
In the mean time we will write and distribute scientific paper containing the main
ingredients of the developed methodologies.
The complete portability of the code to different computer with different operation
systems (and C compilers) will be ensured. The programming work will be done on
compatible PC computers, and will be extensively tested against programming errors
by using professional debuggers and memory checker software (Purify).
Linear programming relaxations of ILP models will be solved by using and external
robust (commercial) LP-solver. The good performance of the final software highly
depends on the data structure adopted to store internally the input table and the
relevant information. Alternative data structures will be proposed and compared
computationally on the SDC-lib instances collected from our partners; these instances
which will also be used for code testing during all the programming phases.
Closer those instances are to the real-world instances, better performance would have
our software. Therefore this work strongly needs appropriated (small, medium, and large
sizes) tables from our partners.
During the last part of our work we will be listening and incorporating all
the corrections and suggestions that our partners testing our software will give us.
Nevertheless, the main help from our partners will come early in the project by providing
us with real data. The software will try to solve to optimality medium instances in
reasonable computing time on personal computers, but the precise definition of medium
instances can not be set-up at the moment. From our experience we know that the
complexity of exact methods grows drastically with the dimension of the table,
so tables with cells inserted in more than four dimensions tend to be really difficult
for exact approaches. We will expect that the structure of the real-world instances
(in general much easier than random generated instances) would be capture by our
approaches and therefore our software can solve bigger tables.
Milestones and expected result
We expect to model mathematically the optimisation problems arising in the Statistical
Disclosure Control on hierarchical and linked tables. We expect to insert the
theoretical ideas into a code, solving the usually incoming difficulties.
Finally, we expect to write and publish the main scientific results of our work
in top-level journals.
Task 2 (responsability UPC)
Objectives
The aim of this task is twofold: firstly, augmenting τ-ARGUS with an algorithm
for secondary cell suppression based on network flows algorithms; and secondly,
attempting to improve this methodology with recently appeared optimisation methods.
The first of the objectives is not new, and it has already been described and applied
by other authors (e.g. Cox). However, τ-ARGUS lacks such a methodology,
which has proven to be especially appropriate for large cell suppression problems.
The optimisation problems that appear are large-scale network flow ones.
For big tables, even the most efficient specialised network-simplex algorithms can
result in long execution times. The second objective is directly related to reducing
these execution times, by means of recently developed interior-point network flow
algorithms. For big networks, these interior-point specialised methods have shown
to be more efficient than specialised simplex algorithms. It would be worth while
studying new possible solution strategies as well. Among these we just mention the
application of algorithms for network flow problems with side constraints,
with nonlinear objective functions, multicommodity flows, and semidefinite programming.
In this workpackage, we will just focus on the multicommodity flows algorithm.
Since the research component of this approach is considerable, the efficiency of its
implementation (even its availability) can be a difficult goal to achieve.
Description of work
The first goal of this task is to implement an algorithm for secondary cell suppression.
The algorithm would be based on the methodology described in Cox.
Additional refinements for the heuristic procedures suggested for interval cell
disclosure could be attempted. The algorithm would be implemented in C or C++,
and ready to be integrated within τ-ARGUS. This new option of τ-ARGUS should be
able to efficiently provide good solutions for big tables. An external network flow
solver should be used (e.g., Xpress, Cplex 6.0 or PPRN).
To deal with large tables, a replacement of the network simplex algorithm based on a
specialised interior-point network method will also be developed. This new algorithm
could be selected by the user of τ-ARGUS as a new option of the package.
The interior-point network solver should be more efficient than classical network
simplex codes in very large two-dimensional tables (thousands of columns and/or
thousands of rows). To our best knowledge, no implementations of such a methodology
are yet available. The implementation would be developed in C or C++, and it would be
integrated in τ-ARGUS. An external solver could be used for the crossover stage
(i.e., recovering a basic solution from a nonbasic one).
The second goal is to study how to apply new optimisation methodologies to the
higher-dimensional interval cell suppression problem. These methodologies would attempt
to solve a continuous formulation of the cell suppression problem. This problem is
usually formulated as an integer programming problem, guaranteeing its optimal
solution at the expense of computational efficiency. In this task we would attempt
to formulate a continuous (relaxed) problem, which would be solved through a
multicommodity (linear or nonlinear) solver.
Milestones and expected result
Secondly an implementation of the network flow approach for secondary cell suppression.
|